Probability distribution for binomial process:
\[
P(k | n, \theta) = \binom{n}{k} \theta^k (1 - \theta)^{n - k}
\]
where
\(k\) successes in \(n\) trials; and
the probability of success on any trial is \(\theta\) .
Multiple experiments: for session \(i\) there are \(k_i\) hits from \(n_i\) spins.
At the session level the relationship between spins and hits can be modelled as a binomial process. For each session we are considering the number of spins (or “trials”) denoted by \(n\) and the number of hits (or “successes”) denoted by \(k\) . And \(\theta\) is the probability of success on any spin.
The binomial distribution allows us to calculate the probability of \(k\) given specific values for \(n\) and \(\theta\) . So this assumes that you know \(\theta\) … But what if you don’t? Certainly for most slot machines you don’t know the probability of winning. If you did then you probably wouldn’t be very tempted to play.
Suppose we had observations of \(k\) and \(n\) , and wanted to estimate \(\theta\) .
Probability distribution for Bernoulli process:
\[
P(k | \theta) = \theta^k (1 - \theta)^{1 - k}
\]
where
\(k = 0\) indicates failure;
\(k = 1\) indicates success; and
the probability of success on any trial is \(\theta\) .
A Bernoulli process is the atomic component of a binomial process and it considers a single trial at a time.
Bayes’ Theorem
\[
p(\theta|y, X) = \frac{p(y|X, \theta) p(\theta)}{p(y)} \propto p(y|\theta) p(\theta)
\] where
\(y\) are observations;
\(X\) are predictors;
prior — \(p(\theta)\) ;
likelihood — \(p(y|X, \theta)\) ; and
posterior — \(p(\theta|y, X)\) .
Our plan was to use Bayesian techniques to extract the parameters for these distributions.
Maud is quite familiar with these techniques but I had to remind myself of how it all works.
Bayes Theorem relates three important components: a prior, likelihood and posterior.
The prior specifies what we know about the model parameters before considering the data.
The likelihood gives the probability of the data conditional on the model and choice of parameter.
The posterior is what we know about the model parameters after factoring in the data.
A powerful feature is that this can be applied iteratively, with the posterior from one iteration becoming the prior in the following iteration.
Despite the apparent simplicity of this relationship it has historically been rather difficult to apply due to computational challenges.
Note:
The denominator is sometimes called the “Evidence”. It’s just the marginal distribution of \(y\) .
So much for theory. Analytical expressions are rare in practice.
Confounding features:
data are often multi-dimensional;
models have multiple parameters.
So evaluating \(p(\theta|y, X)\) becomes challenging!
One approach to applying Bayes’ Theorem is to use a regular grid of parameter values. This works well and you can see that the shape of the posterior (as represented by points on the grid) evolves with each iteration.
But there’s a major problem with the grid approximation: it scales poorly with the number of model parameters and rapidly becomes intractable.
Stan:
RStan:
We’re not going to stress about the details of Markov chains because we’ll be using a remarkable piece of software called Stan.
Stan is a high level language for writing statistical models. There’s a package for it in R. Both are being aggresively developed.
Note:
It uses Hamiltonian Monte Carlo, which applies some general principles from Physics to provide much more efficient sampling. Yes, it takes a little longer to generate each sample, but those samples are guaranteed to explore parameter space much more effectively.
The Hamiltonian MC is much more efficient than vanilla MCMC or the Metropolis Algorithm.
Stan Workflow
Choose a model.
Write Stan program (likelihood and priors).
Stan parser converts this to C++.
Compile C++.
Execute compiled binary.
Evaluate results. Optionally return to 2.
Inference based on posterior sample.
To use rstan you need a .stan and a .R .
Stan Skeleton
We’ll start by building a model to address Maud’s first burning question: what is the hit rate?
This is the basic structure of a Stan file. It consists of a number of blocks, not all of which are required.
The most important blocks are:
data which specifies the observations;
parameters which lists the parameters of the model; and
model which defines the likelihood and priors.
data {
int<lower = 0> N;
int<lower = 0, upper = 1> y[N];
}
parameters {
real<lower = 0, upper = 1> theta;
}
model {
y ~ bernoulli(theta);
}
Our first Stan model has two elements of data: the number of samples and the outcomes of the individual spins, encoded as a binary vector.
We know that theta must lie between 0 and 1.
We specify a Bernoulli likelihood.
We haven’t given a prior on theta, so by default Stan will use a uniform prior: any value of theta is equally likely.
library (rstan)
# Use all available cores.
#
options (mc.cores = parallel:: detectCores ())
trials <- list (
N = nrow (spin),
y = spin$ success
)
fit <- stan (
file = "bernoulli.stan" ,
data = trials,
chains = 2 , # Number of Markov chains
warmup = 500 , # Warmup iterations per chain
iter = 1000 # Total iterations per chain
)
This is the corresponding R code. Things to note:
it loads the rstan library;
data are transferred to Stan via a list (with names corresponding to those in the Stan file);
Stan is invoked via the stan() function; and
specify the number of chains as well as number of warmup and compute iterations per chain.
The “chains” are initially a little mysterious, but these are basically just independent series of samples. They aid better sampling of parameter space and also facilitate parallelism.
The warmup is a number of iterations during which the distribution should, in principle, achieve an equilibrium.
Note that there are 1000 iterations in total per chain, of which 500 are warmup, which leaves 500 active iterations. So, in total there are 1000 (2 times 500) active iterations.
Note:
Set the number of cores to be used for parallel execution (otherwise chains are run in series).
A traceplot shows the sampled values as a function of iteration number.
Let’s take a moment to see where those samples came from. We specified 2 chains, each with 500 warmup iterations and 1000 iterations in total, leaving 500 active iterations per chain. So that accounts for the 1000 samples in the stanfit object.
The chains are uncorrelated, but successive samples within each chain are clearly related to each other. Ideally we are aiming for good “mixing” across the domain of the distribution.
What’s happening during the warmup phase? The parameters of the MC are getting tuned to optimally explore the parameter space (one aspect of this is choosing the step size). By the time that the active iterations start the sampler should have more or less reached an equilibrium.
[1] "stanfit"
attr(,"package")
[1] "rstan"
samples <- as.data.frame (fit)
head (samples)
theta lp__
1 0.3095256 -1226.968
2 0.3072672 -1227.066
3 0.3124851 -1226.912
4 0.3200480 -1227.132
5 0.3140362 -1226.914
6 0.2929604 -1228.812
[1] 1000
We don’t actually get the posterior directly. The stanfit object is really just a collection of samples from the posterior.
Note:
We can access the samples using the extract() function, which yields a list. It’s normally easier to just convert these samples to a data frame.
The number of samples that we get is precisely the total number of active iterations across all chains.
The samples for success probability look like they have a Normal distribution and are centred on the value that we expected from earlier analysis, around 30%.
print (fit, probs = c (0.025 , 0.5 , 0.975 ))
Inference for Stan model: bernoulli.
2 chains, each with iter=1000; warmup=500; thin=1;
post-warmup draws per chain=500, total post-warmup draws=1000.
mean se_mean sd 2.5% 50% 97.5% n_eff Rhat
theta 0.31 0.00 0.01 0.29 0.31 0.33 413 1
lp__ -1227.43 0.04 0.70 -1229.56 -1227.14 -1226.91 348 1
Samples were drawn using NUTS(diag_e) at Tue May 15 17:30:35 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
Use summary() to get information for each chain.
This is very neatly presented in the stanfit summary information.
There’s the mean value of theta as well as the lower and upper quantiles of the distribution. Don’t think of a confidence interval. These are quantiles of the density. Close to the median they are stable, but further into the tails they can be noisy.
The average hit rate is in good agreement with our earlier result.
Maud is immediately more comfortable with this result because, even with a uniform prior, the success rate is constrained to a reasonable domain between 0 and 1.
n_eff is the effective sample size. This takes into account that the samples are not independent (this is a consequence of correlation in the Markov Chain). This is a vital component of the Monte Carlo version of the Central Limit Theorem.
We can see the degree of correlation in a simple autocorrelation plot.
data {
int<lower=0> N;
int hits[N];
int spins[N];
}
parameters {
real<lower=0,upper=1> theta;
}
model {
hits ~ binomial(spins, theta); // Likelihood
theta ~ beta(2, 2); // Prior
}
Let’s take a look at the session data. We need to treat these data as coming from a binomial distribution.
The data that we are providing consists of two vectors, hits and spins.
We are assuming a binomial relationship between hits and spins with a success probability denoted by the parameter theta.
As an experienced gambler Maud knows that the likelihood of the hit rate was not uniformly distributed between 0 and 1. We could have factored this knowledge into the computation by applying a more informed prior. For example, a broad beta distribution peaked at 0.5. The results are essentially unchanged though.
Note:
The ~ represents a stochastic relationship. An = would indicate a deterministic relationship.
The beta distribution is the conjugate prior for the binomial likelihood (this means that the corresponding posterior will also be a beta distribution).
print (fit, probs = c (0.025 , 0.5 , 0.975 ))
Inference for Stan model: binomial-beta-prior.
2 chains, each with iter=1000; warmup=500; thin=1;
post-warmup draws per chain=500, total post-warmup draws=1000.
mean se_mean sd 2.5% 50% 97.5% n_eff Rhat
theta 0.31 0.00 0.01 0.29 0.31 0.33 280 1
lp__ -1228.96 0.03 0.75 -1231.15 -1228.66 -1228.45 545 1
Samples were drawn using NUTS(diag_e) at Tue May 15 17:31:38 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
This is very neatly presented in the stanfit summary information.
There’s the mean value of theta as well as the lower and upper quantiles of the distribution. Don’t think of a confidence interval. These are quantiles of the density. Close to the median they are stable, but further into the tails they can be noisy.
The average hit rate is in good agreement with our earlier result.
Maud is immediately more comfortable with this result because, even with a uniform prior, the success rate is constrained to a reasonable domain between 0 and 1.
n_eff is the effective sample size. This takes into account that the samples are not independent (this is a consequence of correlation in the Markov Chain). This is a vital component of the Monte Carlo version of the Central Limit Theorem.
We can see the degree of correlation in a simple autocorrelation plot.
The nature of MCMC is that successive samples of the parameters are correlated. As a result the samples cannot be considered as independent and so we need to adjust the sample count accordingly.
Let’s move onto Maud’s second burning question, quantifying the RTP.
In an ideal world RTP would be close to 1. However, this is not the case. A lower RTP obviously favours the casino.
Looking at the distribution of RTP values we can see that there is appreciable variation between sessions. We’d like to quantify the mean RTP and get an idea of the associated uncertainties.
Now this distribution is definitely not Normal. The long tail is a dead giveaway. Maybe it could be described by a log-Normal though? This makes sense too based on the multiplicative nature of slot machine returns.
data {
int<lower=0> N;
real rtp[N];
}
parameters {
real<lower = 0> mu;
real<lower = 0> sigma;
}
model {
rtp ~ lognormal(log(mu) - sigma * sigma / 2, sigma);
}
generated quantities {
real simulated[25];
for (i in 1:25) {
simulated[i] = lognormal_rng(log(mu) - sigma * sigma / 2, sigma);
}
}
Now our Stan code specifies a log-Normal for the likelihood, with two parameters, mu and sigma.
We’re not going to specify a prior for either of the model parameters, but simply constrain them to being positive real numbers.
Note:
The mean of lognormal() is exp(mu + sigma * sigma / 2).
It’s good practice to do a posterior predictive check. We do this by generating simulated samples from the posterior distribution. These samples are specified in the generated quantities block. It looks like there is pretty good agreement between the data and the model.
These simulated values take into account the uncertainties in the model parameters too.
Inference for Stan model: lognormal-rtp.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
mu 0.80 0.00 0.07 0.69 0.76 0.80 0.84 0.94 1901 1
sigma 0.72 0.00 0.05 0.62 0.68 0.71 0.75 0.83 1852 1
lp__ -16.12 0.02 1.01 -18.77 -16.48 -15.81 -15.41 -15.16 1759 1
Samples were drawn using NUTS(diag_e) at Tue May 15 17:32:48 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
The average RTP is around 80%.
Here’s the Kicker
What’s the probability of breaking even?
[1] 0.25117
What’s the probability of doubling your money?
[1] 0.05319
Since we have access to the posterior distribution we are able to calculate all manner of statistics without relying on (poor) asymptotic approximations. For example, we can find the probability of breaking even or doubling our money.
These figures are enough to make Maud reconsider her gambling fixation.
Q3: Hit Rate per Combination
description
symbols
payout
0
1x mouse
🐭
1
1x cat
🐱
2
2x mouse
🐭🐭
5
2x cat
🐱🐱
10
3x mouse
🐭🐭🐭
20
3x cat
🐱🐱🐱
100
data {
int<lower=0> N;
real rtp[N];
}
parameters {
real<lower=0, upper=1> theta[6];
real<lower=0> sigma;
}
transformed parameters {
real<lower=0> mu;
mu = theta[1] * 1 + // Payline 1 -> 1x
theta[2] * 2 + // Payline 2 -> 2x
theta[3] * 5 + // Payline 3 -> 5x
theta[4] * 10 + // Payline 4 -> 10x
theta[5] * 20 + // Payline 5 -> 20x
theta[6] * 100; // Payline 6 -> 100x
}
model {
rtp ~ lognormal(log(mu) - sigma * sigma / 2, sigma);
theta[1] ~ beta(2, 2); // Mode = 0.5
theta[2] ~ beta(2, 4); // Mode = 0.25
theta[3] ~ beta(2, 5); // Mode = 0.2
theta[4] ~ beta(2, 8); // Mode = 0.125
theta[5] ~ beta(2, 10); // Mode = 0.1
theta[6] ~ beta(2, 16); // Mode = 0.0625
}
Maud got the idea for this while pruning her roses. The idea is to use the observations of RTP per session to infer the relative frequency of the winning combinations. Since we know the payout for each combination it seems like this might be possible, but I was not convinced that we had enough information.
We’ve defined a vector of rate parameters, theta, and another parameter, pay, which is derived from them. The relationship shows up in the transformed parameters block and the six different payouts are used as weights.
The priors are based on Maud’s gut feel for the relative frequency of each of the payouts. They carry a little information about the probabilities of each payline, but not a lot.
The traceplot paints an interesting picture. It’s readily apparent here the importance of the warmup iterations: the first few samples for some of the parameters are very far from the core of their distribution. This could be improved by providing more restrictive priors.
mean se_mean sd 2.5% 97.5% n_eff Rhat
theta[1] 0.140072477 1.445106e-03 0.0886783678 0.0186286589 0.352326667 3765.615 0.9993099
theta[2] 0.068064522 6.711297e-04 0.0424459690 0.0102927430 0.168731470 4000.000 1.0001725
theta[3] 0.028892776 2.894141e-04 0.0183041579 0.0033953293 0.072914250 4000.000 0.9998855
theta[4] 0.014594931 1.437237e-04 0.0090898869 0.0019018620 0.036523596 4000.000 0.9999596
theta[5] 0.007441395 7.525103e-05 0.0047592927 0.0009537422 0.019112895 4000.000 1.0000193
theta[6] 0.001517744 1.482361e-05 0.0009375275 0.0002257116 0.003729426 4000.000 1.0005894
The 1x payout is triggered on average every 7 spins.
The 100x payout is triggered on average only every 659 spins.
The results are interesting. We’ve derived values for the frequency of each of the winning combinations. Granted there is some uncertainty in those values, but not much.
The small payouts are relatively frequent (designed to keep you playing the game) but the big payouts are rather rare.
These are the vital statistics for determining the characteristics of the game. So we have effectively reverse engineered it.
Metropolis-Hastings Algorithm
Randomly sample \(\theta^{(0)}\) .
Set \(i = 1\) .
Randomly sample proposal \(\theta^{*}\) in the vicinity of \(\theta^{i-1}\) .
Sample \(u\) uniformly on \([0, 1)\) .
\[
\theta^{(i)} = \left\{\begin{array}{ll}
\theta^{*} & \text{if } u \cdot p(\theta^{i-1}|y, X) < p(\theta^{*}|y, X) \\
\theta^{(i-1)} & \text{otherwise.}
\end{array}\right.
\]
Increment \(i\) .
Return to 1.
The Metropolis-Hastings Algorithm is the classical implementation of MCMC.
The algorithm proceeds by sampling a proposal point in the vicinity of the current point and then either accepting or rejecting that point depending on the ratio of the probability density at the two points. If the probability of the new point is higher then we always accept. If it’s lower then we accept with a probability that is the ratio of the densities.
It might occur to you that the samples are now no longer independent. And you’d be completely correct. We’ll see how this is accounted for.
Metropolis Monte Carlo is better than the Grid Approximation. However in an absolute sense it’s not really all that good. Especially not in higher dimensions, where it can take a very long time to sample the parameter space.
Note:
The size of the “vicinity” (effectively the step size) determines the acceptance rate of new points. If it’s too large then few new points will be accepted. If it’s too small then the rate of diffusion will be painfully slow.
Without step 3 this would essentially give us diffusion, a random walk. But this step, the Metropolis Acceptance Procedure, ensures that sampling converges to the underlying distribution.
The principle advantage of MCMC is that we don’t need to know the maximum value of the posterior.
Also it only depends on the ratio of posteriors, so these do not need to be normalised.
Poisson
# trials <- list(
# N = nrow(sessions),
# spins = sessions$spins
# )
#
# fit <- stan(
# file = "poisson.stan",
# data = trials,
# chains = 4,
# warmup = 1000,
# iter = 2000
# )
# extract(fit)
# hist(extract(fit)$lambda)
Regression
# ggplot(sessions, aes(spins, wager)) + geom_jitter()
Regression Stan
data {
int<lower=0> N;
vector[N] y; // Total wager
vector[N] x; // Number of spins (standardised)
}
parameters {
real alpha;
real beta;
real<lower=0> sigma;
}
model {
y ~ normal(alpha + beta * x, sigma); // Likelihood
//
alpha ~ normal(0, 10); // Prior (Normal)
beta ~ normal(0, 10); // Prior (Normal)
sigma ~ cauchy(0, 5); // Prior (Cauchy)
}
Because the number of spins has been standardised, the intercept will be the wager for the average number of spins.
Regression Stan R
In a linear model we are saying that the response is normally distributed with a mean that depends on the covariates and a constant standard deviation. Assuming homogeneous noise.
The half-Cauchy is essentially a one-sided, heavy tailed version of the Normal distribution. This constrains the value of sigma to be positive but is not as restrictive as a Normal distribution.
… the theory of inverse probability is founded upon error and must be wholly rejected. Sir Ronald Fisher (1925)
Not everybody’s a fan of the Bayesian approach. Ronald Fisher, had this to say about inverse probability as Bayesian inference was known in his day.
Possibly one of the reasons for their skepticism is that historically it has been computationally challening to actually apply these techniques. This is no longer true.
In general you can’t apply Bayesian techniques analytically.
\[
\text{hit rate} = \theta^* = \frac{\text{total hits}}{\text{total spins}} = \frac{\sum_i k_i}{\sum_i n_i}
\]
with (sessions, sum (hits) / sum (spins))
[1] 0.3128803
Well \(\theta\) is just the expected number of hits per spin. So we could lump all of the data together and treat it as a single massive experiment, taking the ratio of the total number of hits to the total number of spins across all sessions.
This would certainly maximise the amount of data and should, in principle, give us a reasonably accurate result. But there is no way to quantify our uncertainty.
\[
\text{hit rate for session } i = \theta^*_i = \frac{k_i}{n_i}
\]
# A tibble: 1 x 2
session_avg session_std
<dbl> <dbl>
1 0.3100018 0.1030164
95% confidence interval extends from 28.9% to 33.1%.
Another approach is to treat each session as an independent experiment, each of which yields an estimate for the hit rate. Now we can calculate both mean and variance, and hence a confidence interval on the hit rate.
Maud is fairly happy with this result, but it does concern her that the confidence interval on the hit rate could, in principle, extend to values less than zero or greater than one. Neither of which is physically possible.
Assuming that sessions \(i = 1, 2, 3, \ldots\) are independent:
\[
P(k | n, \theta) = \prod_i P(k_i | n_i, \theta) = L(\theta; n, k)
\]
Let’s consider another approach: the joint probability distribution for all of the sessions lumped together. If we assume that they are independent, which is a reasonable assumption, then we can simply multiply together the distributions for each session. This gives us, one the one hand, a distribution for the number of hits in each of the sessions. But if we think of this as a function of \(\theta\) then it’s the likelihood of a particular value of \(\theta\) given observed values of \(k_i\) and \(n_i\) .
If we maximise this function then we have the most likely value of \(\theta\) based on the data.
Note:
\(P(k | n, \theta)\) and \(L(\theta; n, k)\) are functionally equivalent but
\(P(k | n, \theta)\) is a function of \(k\) (and is a probability distribution) and
\(L(\theta; n, k)\) is a function of \(\theta\) (and is not a probability distribution).
Log-likelihood for binomial process (multiple trials):
\[
LL(\theta; n, k) = \sum_i k_i \log{\theta} + (n_i - k_i) \log{(1 - \theta)}
\]
However we’d be multiplying together a collection of small numbers, which would mean that we would rapidly lose precision. Instead we work with the log likelihood.
Note:
Neglect the binomial coefficient since it’s a constant (as a function of theta).
Because the logarithm is a strictly increasing function, the log likelihood is maximised in the same place as the likelihood.
# A tibble: 6 x 2
theta log_likelihood
<dbl> <dbl>
1 0.31 -1225.411
2 0.32 -1225.604
3 0.30 -1226.146
4 0.33 -1226.692
5 0.29 -1227.843
6 0.34 -1228.649
If we plot the log-likelihood as a function of \(\theta\) then it’s a simple matter to identify the most probable value.
In the Maximum Likelihood approach we simply select the value of the parameter which gives the largest likelihood.
This gives us a point estimate for \(\theta\) with no indication of uncertainty. However, the likelihood is a tool which will allow us to get a much better characterisation of the hit rate.
Note:
The assumption implicit in this approach is that in the absence of data all parameter values are equally probable.
Maud’s Bayesian Answers
What is the hit rate? 31.3%
What is the RTP? 80.1%
What is the hit rate for each winning combination?
frequent low paying combinations (every 7 spins for 1x)
rare high paying combinations (every 659 spins for 100x)
So, returning to Maud’s three burning questions, not only have we managed to obtain point estimates to answer each of them, but we have actually generated distributions for each of these quantities, which are enormously more informative. With this kind of information in hand it becomes relatively simple to reverse engineer such a game.
Monte Carlo (Plain Vanilla Version)
Generate independent samples of \(\theta^{(i)}\) .
Need to have \(p(\theta^{(m)} | y, X)\) .
Monte Carlo is an alternative technique which generates independent, random samples from the parameter space.
Although it’s better in higher dimensions than the grid approximation, it’s still far from perfect.
However there’s yet another alternative…
Note:
The objective is to jump around parameter space in such a way that the probability of being at a point is proportional to the distribution.
Efficient algorithms exist for doing this with simple distributions. However, any non-trivial distribution will generally require you to apply something like rejection sampling. And to make that efficient we’d need to know the maximum value of the distribution.
Markov Chain Monte Carlo (MCMC)
Generate a series of samples: \(\theta^1\) , \(\theta^2\) , \(\theta^3\) , … \(\theta^M\) .
\(\theta^m\) depends on \(\theta^{m-1}\) .
Need to have \(p(\theta^{m} | \theta^{m-1}, y, X)\) .
Markov Chain Monte Carlo generates a sequence of samples where each new sample is random but also dependent on the previous sample.
ShinyStan
library(shinystan) launch_shinystan(fit) # # Diagnose -> PPcheck “Posterior predictive check” (look at distribution of observed data versus replications - do they look similar? “If our model is a good fit then we should be able to use it to generate data that looks a lot like the data we observed.”)